FLAT-LLM: Fine-grained Low-rank Activation Space Transformation for Large Language Model Compression
A Fast, Training-free Approach to Efficient LLM Deployment
Authors: J. Tian et al.
Published on Arxiv: 2025-05-29
Link: http://arxiv.org/abs/2505.23966v1
Institutions: University of California, Santa Barbara • Intel Corporation
Keywords: large language models, model compression, low-rank decomposition, principal component analysis, structural pruning, attention mechanisms, importance ranking, inference speedup, activation space transformation, WikiText-2, Llama-2, Mistral-7B, SVD-LLM, SliceGPT
Recent advances in Large Language Models (LLMs) have propelled state-of-the-art performance in Natural Language Processing, yet these models are difficult to deploy in resource-limited settings due to their high computational and memory requirements. Model compression—through strategies such as quantization, knowledge distillation, pruning, and particularly low-rank decomposition—has emerged as an essential direction, though most existing methods bring accuracy decline and inefficient architectures.
To address these challenges, the authors propose a novel solution. Their approach and main contributions are summarized as follows:
- FLAT-LLM is a training-free, fine-grained structural compression technique for LLMs, built around low-rank transformations in the activation space.
- Utilizes head-wise Principal Component Analysis (PCA) to reduce hidden dimensions via truncated eigenvectors, focusing on value and output projections within attention layers.
- Introduces an importance-based rank allocation algorithm that adaptively assigns low-rank ratios across decoder layers, preserving salient information while maximizing compression.
- Circumvents prior inefficiencies and excess memory use (from adapter modules), achieving significant weight compression without fine-tuning; calibration takes mere minutes.
- Validated experimentally on four LLMs (Llama-2 7B, 13B, 70B, Mistral-7B) and 11 datasets across multiple NLP tasks, with comprehensive baseline comparisons.
Following from the methodological innovations, the main findings and quantitative results include:
- FLAT-LLM delivers the lowest perplexity among all baselines at 10–50% compression on WikiText-2 (e.g. 240% improvement over SVD-LLM at 50% compression).
- Consistently best or highly competitive accuracy is achieved on diverse downstream tasks (ARC-e, ARC-c, PIQA, WinoGrande, HellaSwag, BoolQ, OBQA, MathQA, CommonsenseQA, MMLU), with only modest drops even at high compression ratios.
- Inference is up to 30% faster than SVD-LLM and SliceGPT, despite no CUDA-specific optimization.
- Calibration time is minimal (around 15 minutes), and the method is robust to different calibration sets and task types.
- The importance-preserving rank selection (IPRS) surpasses uniform allocation, especially at high compression ratios.
Based on these results, the authors conclude with the following key points:
- FLAT-LLM offers a rapid, accurate, and training-free compression method that leverages low-rank, fine-grained activation transformations and adaptive rank allocation.
- It achieves strong generalization, competitive generation quality, and notable inference speedups versus prior model decomposition and pruning techniques, with negligible calibration overhead.
- Some limitations remain: the method primarily targets value and output projections (not query/key projections) and may show performance drops at very high compression ratios; future work aims to address these.